TODO 1.1: Information Visualization on Global air quality data by WHO¶
Let's start with importing some libraries first!
import pandas as pd #Library for manipulating our dataset
import numpy as np #Library for mathematical functions
import seaborn as sns #Library for visualize statistical graphs.
import matplotlib.pyplot as plt #Library for basic visualization graphs and do more customizations.
from matplotlib.ticker import MaxNLocator
import re
import squarify
my_cat_palette = sns.color_palette('Set2')
my_cont_palette = sns.color_palette('Blues')
sns.set_context('notebook')
Let's import our first dataset.
who_air_data = pd.read_csv('/Users/yugaljagtap/Downloads/Infovis/who_aap_2021_v9_11august2022.csv', sep=';', decimal=',')
who_air_data.head() #shows the top 5 data from the dataset
| WHO Region | ISO3 | WHO Country Name | City or Locality | Measurement Year | PM2.5 (μg/m3) | PM10 (μg/m3) | NO2 (μg/m3) | PM25 temporal coverage (%) | PM10 temporal coverage (%) | NO2 temporal coverage (%) | Reference | Number and type of monitoring stations | Version of the database | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Eastern Mediterranean Region | AFG | Afghanistan | Kabul | 2019 | 119.77 | NaN | NaN | 18.0 | NaN | NaN | U.S. Department of State, United States Enviro... | NaN | 2022 |
| 1 | European Region | ALB | Albania | Durres | 2015 | NaN | 17.65 | 26.63 | NaN | NaN | 83.961187 | European Environment Agency (downloaded in 2021) | NaN | 2022 |
| 2 | European Region | ALB | Albania | Durres | 2016 | 14.32 | 24.56 | 24.78 | NaN | NaN | 87.932605 | European Environment Agency (downloaded in 2021) | NaN | 2022 |
| 3 | European Region | ALB | Albania | Elbasan | 2015 | NaN | NaN | 23.96 | NaN | NaN | 97.853881 | European Environment Agency (downloaded in 2021) | NaN | 2022 |
| 4 | European Region | ALB | Albania | Elbasan | 2016 | NaN | NaN | 26.26 | NaN | NaN | 96.049636 | European Environment Agency (downloaded in 2021) | NaN | 2022 |
who_air_data.tail() #Shows the bottom 5 data from the dataset
| WHO Region | ISO3 | WHO Country Name | City or Locality | Measurement Year | PM2.5 (μg/m3) | PM10 (μg/m3) | NO2 (μg/m3) | PM25 temporal coverage (%) | PM10 temporal coverage (%) | NO2 temporal coverage (%) | Reference | Number and type of monitoring stations | Version of the database | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32186 | African Region | ZAF | South Africa | West Coast | 2015 | 7.47 | 24.64 | 7.64 | 75.0 | 75.0 | 75.0 | South African Air Quality Information System | 3 Residential-Medium/Upper income | 2022 |
| 32187 | African Region | ZAF | South Africa | West Coast | 2016 | 8.42 | 33.28 | 7.27 | 75.0 | 75.0 | 75.0 | South African Air Quality Information System | 2 Residential-Medium/Upper income | 2022 |
| 32188 | African Region | ZAF | South Africa | West Coast | 2017 | 6.83 | 20.49 | 8.72 | 75.0 | 75.0 | 75.0 | South African Air Quality Information System | 2 Residential-Medium/Upper income | 2022 |
| 32189 | African Region | ZAF | South Africa | West Coast | 2018 | 6.10 | 17.99 | 7.15 | 75.0 | 75.0 | 75.0 | South African Air Quality Information System | 2 Residential-Medium/Upper income | 2022 |
| 32190 | African Region | ZAF | South Africa | West Rand | 2016 | NaN | NaN | 17.85 | NaN | NaN | 75.0 | South African Air Quality Information System | 1 N/A | 2022 |
who_air_data.columns #shows the column names to make it easy for copy pasting
Index(['WHO Region', 'ISO3', 'WHO Country Name', 'City or Locality',
'Measurement Year', 'PM2.5 (μg/m3)', 'PM10 (μg/m3)', 'NO2 (μg/m3)',
'PM25 temporal coverage (%)', 'PM10 temporal coverage (%)',
'NO2 temporal coverage (%)', 'Reference',
'Number and type of monitoring stations', 'Version of the database'],
dtype='object')
Now let's import our second data set
who_region_income = pd.read_csv('/Users/yugaljagtap/Downloads/Infovis/who_country_income_ratings.csv', sep=';')
who_region_income.head()
| WHO Country Name | WHO Region | World Bank ranking of income 2019 | |
|---|---|---|---|
| 0 | Afghanistan | Eastern Mediterranean | low |
| 1 | Albania | European | upper middle |
| 2 | Algeria | African | lower middle |
| 3 | Andorra | European | high |
| 4 | Angola | African | lower middle |
who_region_income.tail()
| WHO Country Name | WHO Region | World Bank ranking of income 2019 | |
|---|---|---|---|
| 189 | Venezuela (Bolivarian Republic of) | Americas | upper middle |
| 190 | Viet Nam | Western Pacific | lower middle |
| 191 | Yemen | Eastern Mediterranean | low |
| 192 | Zambia | African | lower middle |
| 193 | Zimbabwe | African | lower middle |
who_region_income.columns
Index(['WHO Country Name', 'WHO Region', 'World Bank ranking of income 2019'], dtype='object')
Let's Plot the Regions That Produce the Most NO2
sns.set_palette(my_cat_palette)
sns.set_style('whitegrid')
plt.figure(figsize=(12,5))
sns.barplot(data=who_air_data, x="WHO Region", y="NO2 (μg/m3)", errorbar="sd", hue="WHO Region")
plt.title('NO2 per WHO region')
sns.despine()
plt.tight_layout()
plt.xlabel('WHO Region')
plt.ylabel('NO2 (μg/m3)')
plt.show()
This chart shows the Eastern Mediterranean Region with the highest NO2 levels at approx 45(μg/m3). The high levels in the Eastern Mediterranean are likely due to abundant oil resources, leading to increased fossil fuel use and NO2 emissions from transportation and industry.
Now let's examine whether the income level of a region is related to its air pollution.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6), sharey=False)
# Bar plot for NO2 pollution
sns.barplot(data=who_air_data, x="WHO Region", y="NO2 (μg/m3)", errorbar="sd", hue="WHO Region", legend=False,ax=ax1)
ax1.set_title('NO2 per WHO Region')
ax1.set_xlabel('WHO Region')
ax1.set_ylabel('NO2 (μg/m3)')
ax1.tick_params(axis='x', rotation=45)
sns.despine(ax=ax1)
# Count plot for income distribution
region_order = ['Eastern Mediterranean','European','Americas','Western Pacific','South-East Asian','African']
sns.countplot(x='WHO Region', hue='World Bank ranking of income 2019', data=who_region_income,
order=region_order, ax=ax2)
ax2.set_title('Distribution of WHO Regions by World Bank Income Ranking')
ax2.set_xlabel('WHO Region')
ax2.set_ylabel('Number of Countries')
ax2.tick_params(axis='x', rotation=45)
ax2.legend(title='Income Ranking')
plt.tight_layout()
plt.show()
Based on the charts, the relationship between income and air pollution (NO2) across regions seems complex and not clearly correlated. Europe, with many high and upper-middle-income countries, has low NO2 (around 20 µg/m³), while Eastern Mediterranean, with fewer high-income countries, shows higher NO2 (around 50 µg/m³), suggesting a possible inverse trend. However, Africa’s moderate NO2 (20–30 µg/m³) despite low income indicates income isn’t the sole factor. Other influences, like industrial activity and country policies, may play a role.
Let's examine the top 10 most polluted countries for each type of pollution.
country_means = who_air_data.groupby('WHO Country Name')[['PM2.5 (μg/m3)', 'PM10 (μg/m3)', 'NO2 (μg/m3)']].mean().reset_index()
# Sorting by PM2.5 (Most harmful pollutant) for getting top 10 countries
top_10 = country_means.sort_values(by='PM2.5 (μg/m3)', ascending=False).head(10)
# Melt the DataFrame to long format for Seaborn
top_10_melted = pd.melt(
top_10,
id_vars=['WHO Country Name'],
value_vars=['PM2.5 (μg/m3)', 'PM10 (μg/m3)', 'NO2 (μg/m3)'],
var_name='Pollutant',
value_name='Concentration'
)
# Create a bar plot with Seaborn
plt.figure(figsize=(12, 6))
sns.barplot(
data=top_10_melted,
x='WHO Country Name',
y='Concentration',
hue='Pollutant',
palette=['#FF6384', '#36A2EB', '#4BC0C0'] # Colors for PM2.5, PM10, NO2
)
# Customize the plot
plt.title('Top 10 Most Polluted Countries by PM2.5, PM10, and NO2', fontsize=14)
plt.xlabel('Country', fontsize=12)
plt.ylabel('Concentration (μg/m³)', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.legend(title='Pollutant')
plt.tight_layout()
# Show the plot
plt.show()
The chart, sorted by PM2.5—the most harmful pollutant—shows Afghanistan leading with the highest PM2.5 concentration (around 110 μg/m³), followed by Cameroon and Bangladesh (both near 90 μg/m³). PM10 levels are significantly higher across all countries, with Pakistan peaking at 350 μg/m³, indicating widespread coarse particulate pollution.
Let's examine the top 10 least polluted Countries for each type of pollution.
country2_means = who_air_data.groupby('WHO Country Name')[['PM2.5 (μg/m3)', 'PM10 (μg/m3)', 'NO2 (μg/m3)']].mean().reset_index()
# Sorting by PM2.5 (Most harmful pollutant) for getting top 10 countries
top_10 = country2_means.sort_values(by='PM2.5 (μg/m3)', ascending=True).head(10)
# Melt the DataFrame to long format for Seaborn
top_10_melted = pd.melt(
top_10,
id_vars=['WHO Country Name'],
value_vars=['PM2.5 (μg/m3)', 'PM10 (μg/m3)', 'NO2 (μg/m3)'],
var_name='Pollutant',
value_name='Concentration'
)
# Create a bar plot with Seaborn
plt.figure(figsize=(12, 6))
sns.barplot(
data=top_10_melted,
x='WHO Country Name',
y='Concentration',
hue='Pollutant',
palette=['#FF6384', '#36A2EB', '#4BC0C0'] # Colors for PM2.5, PM10, NO2
)
# Customize the plot
plt.title('Top 10 Least Polluted Countries by PM2.5, PM10 and NO2', fontsize=14)
plt.xlabel('Countries', fontsize=12)
plt.ylabel('Concentration (μg/m³)', fontsize=12)
plt.xticks(rotation=50, ha='right')
plt.legend(title='Pollutant')
plt.tight_layout()
# Show the plot
plt.show()
The chart shows the top 10 least polluted countries globally, ranked by PM2.5, PM10, and NO2 levels. The Bahamas has the lowest PM2.5 at ~4 µg/m³, while others range from 5.5 to 8 µg/m³, slightly above the WHO guideline of 5 µg/m³. PM10 levels are 10–20 µg/m³ (WHO guideline: 15 µg/m³), and NO2 levels are 5–15 µg/m³ (WHO guideline: 10 µg/m³).
Let's take a look on how the pollution developed over the time in Fingal (by No2)
# Filter data for Fingal
fingal_data = who_air_data[who_air_data['City or Locality'].str.contains('Fingal', case=False, na=False)]
fingal_data = fingal_data[fingal_data['Measurement Year'] <= 2018]
# Sort data by Measurement Year for chronological order
fingal_data_sorted = fingal_data.sort_values(by='Measurement Year')
# Create the line chart
plt.figure(figsize=(10, 5))
plt.plot(fingal_data_sorted['Measurement Year'], fingal_data_sorted['NO2 (μg/m3)'],
marker='o', linestyle='-', color='r', label='NO2 (μg/m³)')
plt.title('NO2 Pollution in Fingal Over Time', fontsize=14)
plt.xlabel('Measurement Year', fontsize=12)
plt.ylabel('NO2 Concentration (μg/m³)', fontsize=12)
plt.grid(True)
plt.legend()
plt.show()
The chart shows NO2 pollution in Fingal from 2013 to 2018. Levels dropped from 27.5 µg/m³ in 2013 to 12.5 µg/m³ in 2015, then rose to 30 µg/m³ in 2016, dipped to 27.5 µg/m³ in 2017, and peaked at 32.5 µg/m³ in 2018.
who_air_data.columns
Index(['WHO Region', 'ISO3', 'WHO Country Name', 'City or Locality',
'Measurement Year', 'PM2.5 (μg/m3)', 'PM10 (μg/m3)', 'NO2 (μg/m3)',
'PM25 temporal coverage (%)', 'PM10 temporal coverage (%)',
'NO2 temporal coverage (%)', 'Reference',
'Number and type of monitoring stations', 'Version of the database'],
dtype='object')
Let's Examine If Population Is Related to Pollution in the Top 10 Most Populated Countries.
countries = ['India', 'China', 'United States', 'Indonesia', 'Pakistan',
'Nigeria', 'Brazil', 'Bangladesh', 'Russia', 'Mexico']
# Filtering the dataset for the 10 countries
filtered_data = who_air_data[who_air_data['WHO Country Name'].isin(countries)]
# Group by country and calculate mean for each pollutant
country_means = filtered_data.groupby('WHO Country Name')[['PM2.5 (μg/m3)', 'PM10 (μg/m3)', 'NO2 (μg/m3)']].mean().reset_index()
# Creating a categorical column for ordering
country_means['Country'] = pd.Categorical(country_means['WHO Country Name'],
categories=countries, ordered=True)
country_means = country_means.sort_values('Country')
# Creating a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 6), sharey=True)
# Scatter plot for PM2.5
sns.scatterplot(data=country_means, x='Country', y='PM2.5 (μg/m3)', ax=ax1, color='b', s=100)
ax1.set_title('PM2.5 Levels', fontsize=12)
ax1.set_xlabel('Country (Ordered by Population)', fontsize=10)
ax1.set_ylabel('Concentration (μg/m³)', fontsize=10)
ax1.grid(True)
ax1.tick_params(axis='x', rotation=45)
# Scatter plot for NO2
sns.scatterplot(data=country_means, x='Country', y='NO2 (μg/m3)', ax=ax2, color='r', s=100)
ax2.set_title('NO2 Levels', fontsize=12)
ax2.set_xlabel('Country (Ordered by Population)', fontsize=10)
ax2.set_ylabel('') # Shared y-axis
ax2.grid(True)
ax2.tick_params(axis='x', rotation=45)
# Adjusting layout and title
plt.suptitle('Pollution Levels Across 10 Countries (Ordered by Population)', fontsize=14)
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
TODO 1.2: Information Visualization on How the Sleep Health is related to Lifestyle¶
Now Let start with importing our dataset.
df = pd.read_csv("/Users/yugaljagtap/Downloads/Infovis/Sleep_health_and_lifestyle_dataset.csv")
df.head() #Shows the starting 5 data from the dataset
| Person ID | Gender | Age | Occupation | Sleep Duration | Quality of Sleep | Physical Activity Level | Stress Level | BMI Category | Blood Pressure | Heart Rate | Daily Steps | Sleep Disorder | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Male | 27 | Software Engineer | 6.1 | 6 | 42 | 6 | Overweight | 126/83 | 77 | 4200 | NaN |
| 1 | 2 | Male | 28 | Doctor | 6.2 | 6 | 60 | 8 | Normal | 125/80 | 75 | 10000 | NaN |
| 2 | 3 | Male | 28 | Doctor | 6.2 | 6 | 60 | 8 | Normal | 125/80 | 75 | 10000 | NaN |
| 3 | 4 | Male | 28 | Sales Representative | 5.9 | 4 | 30 | 8 | Obese | 140/90 | 85 | 3000 | Sleep Apnea |
| 4 | 5 | Male | 28 | Sales Representative | 5.9 | 4 | 30 | 8 | Obese | 140/90 | 85 | 3000 | Sleep Apnea |
df.tail() #Shows the bottom 5 data from the dataset
| Person ID | Gender | Age | Occupation | Sleep Duration | Quality of Sleep | Physical Activity Level | Stress Level | BMI Category | Blood Pressure | Heart Rate | Daily Steps | Sleep Disorder | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 369 | 370 | Female | 59 | Nurse | 8.1 | 9 | 75 | 3 | Overweight | 140/95 | 68 | 7000 | Sleep Apnea |
| 370 | 371 | Female | 59 | Nurse | 8.0 | 9 | 75 | 3 | Overweight | 140/95 | 68 | 7000 | Sleep Apnea |
| 371 | 372 | Female | 59 | Nurse | 8.1 | 9 | 75 | 3 | Overweight | 140/95 | 68 | 7000 | Sleep Apnea |
| 372 | 373 | Female | 59 | Nurse | 8.1 | 9 | 75 | 3 | Overweight | 140/95 | 68 | 7000 | Sleep Apnea |
| 373 | 374 | Female | 59 | Nurse | 8.1 | 9 | 75 | 3 | Overweight | 140/95 | 68 | 7000 | Sleep Apnea |
df.columns # This print column names to make it easy for copy pasting
Index(['Person ID', 'Gender', 'Age', 'Occupation', 'Sleep Duration',
'Quality of Sleep', 'Physical Activity Level', 'Stress Level',
'BMI Category', 'Blood Pressure', 'Heart Rate', 'Daily Steps',
'Sleep Disorder'],
dtype='object')
df.isnull().sum() # This will give us the total sum of null value preset in each column.
Person ID 0 Gender 0 Age 0 Occupation 0 Sleep Duration 0 Quality of Sleep 0 Physical Activity Level 0 Stress Level 0 BMI Category 0 Blood Pressure 0 Heart Rate 0 Daily Steps 0 Sleep Disorder 219 dtype: int64
df['Sleep Disorder'].unique() # See the how many different types of sleep disorder are there in the data sets
array([nan, 'Sleep Apnea', 'Insomnia'], dtype=object)
Lets convert Null values to "No sleep disorder"¶
df['Sleep Disorder'] = df['Sleep Disorder'].fillna('None')
print(df['Sleep Disorder'].value_counts())
Sleep Disorder None 219 Sleep Apnea 78 Insomnia 77 Name: count, dtype: int64
df.isnull().sum() # we got rid of all the null values
Person ID 0 Gender 0 Age 0 Occupation 0 Sleep Duration 0 Quality of Sleep 0 Physical Activity Level 0 Stress Level 0 BMI Category 0 Blood Pressure 0 Heart Rate 0 Daily Steps 0 Sleep Disorder 0 dtype: int64
sns.set_style("whitegrid")
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='Occupation', hue='Gender',palette={'Male': 'black', 'Female': 'pink'})
plt.title('Proportion of Gender by Occupation')
plt.xlabel('Occupation')
plt.xticks(rotation=90)
plt.ylabel('Count')
plt.legend(title='Gender')
<matplotlib.legend.Legend at 0x1300a22a0>
In this chart, we can see that in the medical field, doctors are predominantly male, as indicated by the taller male segment, while nurses are mostyly female, with significantly larger female segment.¶
sns.scatterplot(df,x='Occupation',y='Sleep Duration',hue='Sleep Disorder')
plt.title('Scatter Plot of Occupation vs. Sleep Duration')
plt.xlabel('Occupation')
plt.xticks(rotation=90)
plt.ylabel('Sleep Duration')
plt.legend(title='Sleep Disorder', framealpha=0.5, bbox_to_anchor=(1.05, 1), loc='upper left') # Move legend outside
plt.tight_layout()
plt.show()
Plotting the Proportion of Sleep Disorders by Gender Using Two Chart Types: Count Plot and Pie Charts¶
sns.set_style("whitegrid")
male_data = df[df['Gender'] == 'Male']['Sleep Disorder'].value_counts()
female_data = df[df['Gender'] == 'Female']['Sleep Disorder'].value_counts()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
# Male pie chart
ax1.pie(male_data, labels=male_data.index, autopct='%1.1f%%', colors=my_cat_palette[:len(male_data)])
ax1.set_title('Sleep Disorders (Male)')
# Female pie chart
ax2.pie(female_data, labels=female_data.index, autopct='%1.1f%%', colors=my_cat_palette[:len(female_data)])
ax2.set_title('Sleep Disorders (Female)')
plt.tight_layout()
sns.set_style("whitegrid")
plt.figure(figsize=(8, 8))
sns.countplot(data=df, x='Gender', hue='Sleep Disorder')
# Customize the plot
plt.title('Proportion of Sleep Disorders by Gender')
plt.xlabel('Genders')
plt.ylabel('Count')
plt.legend(title='Sleep Disorder')
<matplotlib.legend.Legend at 0x1302a22a0>
We can observe that men have fewer sleep disorders than women. It is likely that sleep apnea is higher among women because many of them take on caregiving responsibilities. In our previous chart, the occupation most related to this disorder is nursing, and we can see that almost all nurses are women.¶
Both charts display the information clearly with minimal visual clutter, but I prefer the count plot as it shows data for both males and females in a single chart.¶
Lets see whether the sleep disorder is related with the sleep duration.
sns.set_style("whitegrid")
# Create figure
plt.figure(figsize=(8, 6))
# Create violin plot
sns.violinplot(data=df, x='Sleep Disorder', y='Sleep Duration', hue='Sleep Disorder')
# Customize the plot
plt.title('Sleep Duration by Type of Sleep Disorder')
plt.xlabel('Sleep Disorder')
plt.ylabel('Sleep Duration (Hours)')
Text(0, 0.5, 'Sleep Duration (Hours)')
sns.set_style("whitegrid")
plt.figure(figsize=(8, 6))
sns.boxplot(data=df, x='Sleep Disorder', y='Sleep Duration', color='lightgray')
# Add jittered points
sns.stripplot(data=df, x='Sleep Disorder', y='Sleep Duration', color='black', alpha=1, jitter=0.2)
# Customize the plot
plt.title('Sleep Duration by Type of Sleep Disorder')
plt.xlabel('Sleep Disorder')
plt.ylabel('Sleep Duration (Hours)')
Text(0, 0.5, 'Sleep Duration (Hours)')
We can clearly see that individuals with no sleep disorder average 7 to 8 hours of sleep, those with insomnia average about 6.5 hours, and those with sleep apnea show a bimodal distribution, averaging either around 6 hours or 8 hours.
I prefer using violin plots to display this information because they present a cleaner visualization by showing only the distribution area. Unlike box plot with stripplot, which can be overwhelming for viewers due to the display of every individual data point, violin plots provide a clear overview of the data distribution.
The reason a scatterplot is less effective for this data is that the points for each sleep disorder are aligned linearly, causing most of them to overlap and appear as a single point on the chart, which makes it harder to visualize the distribution.
Let's try to see correlation between this categories through Heatmap¶
df_corr = df[['Age','Sleep Duration','Quality of Sleep','Physical Activity Level','Stress Level','Heart Rate','Daily Steps']].corr()
sns.heatmap(df_corr,annot=True, cmap='coolwarm')
<Axes: >
Heatmaps visualize multiple values in a single chart, showing how closely they are correlated with each other.
- The chart reveals that sleep duration is closely correlated with sleep quality.
- Physical activity level is strongly correlated with daily steps.
- Stress level is closely correlated with heart rate.
Conclusion¶
The occupations with the highest rates of sleep disorders are nursing and teaching, both of which are predominantly female-dominated fields.
There is a strong correlation between sleep duration and sleep disorders; individuals with sleep disorders tend to sleep fewer hours or in the case of sleep apnea they tend to sleep either more or less than average.
Women have a greater tendency to develop sleep disorders than men.
TODO 2.1: Exploring multdimensional data with interaction¶
import plotly.express as px
import plotly.offline as plto
plto.init_notebook_mode()
df = pd.read_csv("/Users/yugaljagtap/Downloads/Infovis/world_bank_development_indicators.csv")
df.head() #Shows the starting 5 data from the dataset
| country | date | agricultural_land% | forest_land% | land_area | avg_precipitation | trade_in_services% | control_of_corruption_estimate | control_of_corruption_std | access_to_electricity% | ... | multidimensional_poverty_headcount_ratio% | gini_index | birth_rate | death_rate | life_expectancy_at_birth | population | rural_population | voice_and_accountability_estimate | voice_and_accountability_std | intentional_homicides | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 1960-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 50.340 | 31.921 | 32.535 | 8622466.0 | 7898093.0 | NaN | NaN | NaN |
| 1 | Afghanistan | 1961-01-01 | 57.878356 | NaN | 652230.0 | 327.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 50.443 | 31.349 | 33.068 | 8790140.0 | 8026804.0 | NaN | NaN | NaN |
| 2 | Afghanistan | 1962-01-01 | 57.955016 | NaN | 652230.0 | 327.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 50.570 | 30.845 | 33.547 | 8969047.0 | 8163985.0 | NaN | NaN | NaN |
| 3 | Afghanistan | 1963-01-01 | 58.031676 | NaN | 652230.0 | 327.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 50.703 | 30.359 | 34.016 | 9157465.0 | 8308019.0 | NaN | NaN | NaN |
| 4 | Afghanistan | 1964-01-01 | 58.116002 | NaN | 652230.0 | 327.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 50.831 | 29.867 | 34.494 | 9355514.0 | 8458694.0 | NaN | NaN | NaN |
5 rows × 50 columns
df['year'] = pd.to_datetime(df['date']).dt.year
Checking the count for each year before starting my visualization
year_counts = df['year'].value_counts().sort_index()
print(year_counts)
year
1960 266
1961 266
1962 266
1963 266
1964 266
...
2019 274
2020 274
2021 266
2022 266
2023 266
Name: count, Length: 64, dtype: int64
I am picking year 2020 for my visualization
df_2020 = df[df['year'] == 2020]
Removing all the unwanted region from the 'country' column
regions_to_exclude = [
'Africa Eastern and Southern', 'Africa Western and Central', 'Arab World',
'Caribbean small states', 'Central Europe and the Baltics', 'Early-demographic dividend',
'East Asia & Pacific', 'East Asia & Pacific (IDA & IBRD countries)',
'East Asia & Pacific (excluding high income)', 'Euro area', 'Europe & Central Asia',
'Europe & Central Asia (IDA & IBRD countries)', 'Europe & Central Asia (excluding high income)',
'European Union', 'Fragile and conflict affected situations',
'Heavily indebted poor countries (HIPC)', 'High income', 'IBRD only',
'IDA & IBRD total', 'IDA blend', 'IDA only', 'IDA total', 'Late-demographic dividend',
'Latin America & Caribbean', 'Latin America & Caribbean (IDA & IBRD)',
'Latin America & Caribbean (excluding high income)', 'Latin America & the Caribbean (IDA & IBRD countries)',
'Least developed countries: UN classification', 'Low & middle income', 'Low income',
'Lower middle income', 'Middle East & North Africa',
'Middle East & North Africa (IDA & IBRD countries)','Middle East & North Africa (excluding high income)',
'Middle income', 'North America', 'Not classified', 'OECD members', 'Other small states',
'Pacific island small states', 'Post-demographic dividend', 'Pre-demographic dividend',
'Small states', 'South Asia', 'South Asia (IDA & IBRD)', 'Sub-Saharan Africa',
'Sub-Saharan Africa (IDA & IBRD countries)', 'Sub-Saharan Africa (excluding high income)',
'Upper middle income'
]
df_2020 = df_2020[~df_2020['country'].isin(regions_to_exclude)]
Removing the null values from the columns I want to use
df_2020 = df_2020.dropna(subset=['GDP_current_US', 'life_expectancy_at_birth', 'population', 'country'])
fig = px.scatter(df_2020,
x="GDP_current_US",
y="life_expectancy_at_birth",
size="population",
color="country",
hover_name="country",
log_x=True,
size_max=60,
title="GDP vs. Life Expectancy in 2020",
labels={"GDP_current_US": "GDP (Current US$)",
"life_expectancy_at_birth": "Life Expectancy at Birth (years)",
"population": "Population"})
fig.show()
fig.write_html("multidimensional_bubble_life_expectancy_gdp.html")
1. Hovering for Detail-on-Demand: This bubble chart’s interactivity provides immediate access to granular data through hover-activated tooltips. By moving the cursor over any data point (bubble), detailed metrics for that specific country—such as its precise GDP, life expectancy, and population—are instantly displayed, which also makes it multidimensional. The size of the bubbles is plotted according to the population.
2.Filtering and Highlighting via the Legend: The legend serves as an interactive control panel for filtering and highlighting data. Each country is assigned a distinct color, and the scrollable list allows for easy navigation. More importantly, you can click on entries in the legend to toggle their visibility on the chart. This feature is essential for decluttering the view to perform a direct comparison between a few select countries or to isolate a single country of interest, thereby making the analysis more focused and clear.
TODO 2.2: Exploring hierarchical data with interaction¶
df = pd.read_csv("/Users/yugaljagtap/Downloads/Infovis/world_bank_development_indicators.csv")
regions = {
'Africa': ['Africa Eastern and Southern', 'Africa Western and Central'],
'Americas': ['Latin America & Caribbean', 'North America'],
'Asia': ['East Asia & Pacific', 'South Asia', 'Central Europe and the Baltics'],
'Europe': ['Europe & Central Asia (excluding high income)', 'Euro area', 'European Union'],
'Middle East': ['Middle East & North Africa']
}
def get_region(sub_region):
for region, sub_region_list in regions.items():
if sub_region in sub_region_list:
return region
return None
region_list = [item for sublist in regions.values() for item in sublist]
df_regions = df[df['country'].isin(region_list)].copy()
df_regions['region'] = df_regions['country'].apply(get_region)
df_2020 = df_regions[df_regions['date'].str.contains('2020')]
df_cleaned = df_2020.dropna(subset=['population', 'GDP_current_US'])
fig = px.treemap(df_cleaned,
path=[px.Constant("World"), 'region', 'country'],
values='population',
color='GDP_current_US',
hover_data=['country', 'population', 'GDP_current_US'],
color_continuous_scale='Blues',
title='World Population and GDP by Region and Country (2020)')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
The treemap initially provided a clear part-to-whole comparison of population and GDP across continents. The crucial insight, however, came from clicking on a continental block. This action instantly re-drew the chart to reveal the composition of that specific continent, showing the relative weight of its sub-regions or countries.
Portfolio Part 3: Flight Accidents Analysis¶
df = pd.read_csv('/Users/yugaljagtap/Downloads/Infovis/flight.csv')
df.head()
| Unnamed: 0 | acc.date | type | reg | operator | fat | location | dmg | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 Jan 2022 | British Aerospace 4121 Jetstream 41 | ZS-NRJ | SA Airlink | 0 | near Venetia Mine Airport | sub |
| 1 | 1 | 4 Jan 2022 | British Aerospace 3101 Jetstream 31 | HR-AYY | LANHSA - Línea Aérea Nacional de Honduras S.A | 0 | Roatán-Juan Manuel Gálvez International Airpor... | sub |
| 2 | 2 | 5 Jan 2022 | Boeing 737-4H6 | EP-CAP | Caspian Airlines | 0 | Isfahan-Shahid Beheshti Airport (IFN) | sub |
| 3 | 3 | 8 Jan 2022 | Tupolev Tu-204-100C | RA-64032 | Cainiao, opb Aviastar-TU | 0 | Hangzhou Xiaoshan International Airport (HGH) | w/o |
| 4 | 4 | 12 Jan 2022 | Beechcraft 200 Super King Air | NaN | private | 0 | Machakilha, Toledo District, Grahem Creek area | w/o |
df = df.drop('Unnamed: 0', axis=1)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2500 entries, 0 to 2499 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 acc.date 2500 non-null object 1 type 2500 non-null object 2 reg 2408 non-null object 3 operator 2486 non-null object 4 fat 2488 non-null object 5 location 2500 non-null object 6 dmg 2500 non-null object dtypes: object(7) memory usage: 136.8+ KB
df.head()
| acc.date | type | reg | operator | fat | location | dmg | |
|---|---|---|---|---|---|---|---|
| 0 | 3 Jan 2022 | British Aerospace 4121 Jetstream 41 | ZS-NRJ | SA Airlink | 0 | near Venetia Mine Airport | sub |
| 1 | 4 Jan 2022 | British Aerospace 3101 Jetstream 31 | HR-AYY | LANHSA - Línea Aérea Nacional de Honduras S.A | 0 | Roatán-Juan Manuel Gálvez International Airpor... | sub |
| 2 | 5 Jan 2022 | Boeing 737-4H6 | EP-CAP | Caspian Airlines | 0 | Isfahan-Shahid Beheshti Airport (IFN) | sub |
| 3 | 8 Jan 2022 | Tupolev Tu-204-100C | RA-64032 | Cainiao, opb Aviastar-TU | 0 | Hangzhou Xiaoshan International Airport (HGH) | w/o |
| 4 | 12 Jan 2022 | Beechcraft 200 Super King Air | NaN | private | 0 | Machakilha, Toledo District, Grahem Creek area | w/o |
df['acc.date'] = pd.to_datetime(df['acc.date'], errors='coerce')
def clean_fatalities(fat_str):
if not isinstance(fat_str, str):
return np.nan
fat_str = fat_str.strip()
if fat_str in ['?', 'c', '`']:
return np.nan
if '+' in fat_str:
try:
return sum(int(i) for i in fat_str.split('+'))
except (ValueError, TypeError):
return np.nan
try:
return int(fat_str)
except (ValueError, TypeError):
return np.nan
df['fat'] = df['fat'].apply(clean_fatalities)
df['acc.date'] = pd.to_datetime(df['acc.date'], errors='coerce')
df['year'] = df['acc.date'].dt.year
accidents_per_year = df['year'].value_counts().sort_index()
plt.style.use('seaborn-v0_8-whitegrid')
fig, ax = plt.subplots(figsize=(10, 5))
accidents_per_year.plot(kind='line', marker='o', ax=ax)
ax.set_title('Number of Accidents per Year', fontsize=16)
ax.set_xlabel('Year', fontsize=12)
ax.set_ylabel('Number of Accidents', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
We can see that in 2019 the number of accidents peaked, reaching over 580 incidents. After that, the sharp decline in 2020 might be due to the COVID-19 pandemic, as it led to a massive reduction in commercial and private flights worldwide. With far fewer planes in the air, a corresponding decrease in the absolute number of accidents is expected.
df_operator = df.dropna(subset=['operator', 'fat']).copy()
top_20_operators_by_accidents = df_operator['operator'].value_counts().head(20)
plt.figure(figsize=(12, 8))
sns.barplot(y=top_20_operators_by_accidents.index, x=top_20_operators_by_accidents.values, hue=top_20_operators_by_accidents.index, palette='flare', legend=False)
plt.title('Top 20 Operators by Number of Accidents', fontsize=16)
plt.xlabel('Number of Accidents', fontsize=12)
plt.ylabel('Operator', fontsize=12)
plt.tight_layout()
plt.show()
This chart shows us that private aviation has the highest number of incidents. While major commercial airlines appear frequently, this is likely because of their high volume of operations. There is also a significant number of incidents caused by unknown operators.
df_aircraft = df.dropna(subset=['type']).copy()
top_10_aircraft = df_aircraft['type'].value_counts().head(10)
df_treemap = top_10_aircraft.reset_index()
df_treemap.columns = ['type', 'count']
df_treemap['label'] = df_treemap.apply(lambda row: f"{row['type']}<br>({row['count']})", axis=1)
total_top_10_accidents = df_treemap['count'].sum()
df_treemap['percentage'] = (df_treemap['count'] / total_top_10_accidents) * 100
fig = px.treemap(
df_treemap,
path=[px.Constant("All Aircraft"), 'label'],
values='count',
color='count',
custom_data=['percentage'],
color_continuous_scale='purples',
title='Aircraft Types by Number of Accidents'
)
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.update_traces(
hovertemplate='<b>%{label}</b><br>Accidents: %{value}<br>Share of Top 10: %{customdata[0]:.2f}%<extra></extra>'
)
fig.show()
This chart clearly shows that smaller, multi-purpose aircraft are involved in a higher number of incidents than major commercial jets. The Cessna 208B Grand Caravan stands out the most.
import plotly.express as px
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
geolocator = Nominatim(user_agent="flight_accident_analyzer", timeout=10)
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1, error_wait_seconds=10)
unique_locations = df['location'].unique()
location_coords = {}
for loc in unique_locations:
try:
location_result = geocode(loc)
if location_result:
location_coords[loc] = (location_result.latitude, location_result.longitude)
else:
location_coords[loc] = (None, None)
except Exception as e:
print(f"Error geocoding '{loc}': {e}")
location_coords[loc] = (None, None)
df['latitude'] = df['location'].map(lambda loc: location_coords.get(loc, (None, None))[0])
df['longitude'] = df['location'].map(lambda loc: location_coords.get(loc, (None, None))[1])
df_geo = df.dropna(subset=['latitude', 'longitude']).copy()
df_geo.dropna(subset=['fat'], inplace=True)
damage_mapping = {
'sub': 'Substantial',
'w/o': 'Written Off',
'non': 'None',
'unk': 'Unknown',
'min': 'Minor'
}
df_geo['damage_full'] = df_geo['dmg'].map(damage_mapping).fillna('Unknown')
fig = px.scatter_geo(
df_geo,
lat='latitude',
lon='longitude',
hover_name='location',
size='fat',
color='damage_full',
projection="natural earth",
title="Interactive Map of Flight Accidents",
hover_data=['operator', 'type', 'acc.date'],
width=1100,
height=700,
labels={'damage_full': 'Damage Severity'}
)
fig.show()
The map reveals that while accidents occurred globally, they are concentrated in major aviation hubs. The size of each bubble depends on the number of fatalities. Furthermore, you can hover over any point to investigate the specific details of each incident, including the exact number of fatalities, the operator, and the type of aircraft involved.